Not So Randomly Typing Monkeys – Rank-frequency Behavior of Natural and Artificial Languages

نویسندگان

  • Andreas Krause
  • Andreas Zollmann
چکیده

Power laws arise through many natural processes. Zipf showed that the frequencies of words, as they appear in Shakespeare’s Hamlet, follow a power law distribution. Mandelbrot explained this effect as a result of an underlying information-theoretic optimization problem. Miller invoked doubt by showing that a very simple mechanism could also explain the presence of power laws: A monkey typing words with uniformly and independently selected letters would also produce word frequencies following a power law. In consequence, several other researchers proposed and investigated rankfrequency distributions of randomly generated text. In this paper, we first present a literature overview over this exciting topic. We then propose a class of Hidden Markov Models (HMMs) which generalizes the models previously investigated, generating power law, log-normal and other behavior. We extend a result of Conrad and Mitzenmacher for computing the power law exponent of zero order Markov processes to a setting which captures random walks in d-regular graphs. In an extensive empirical evaluation, we investigate convergence of rank frequency distributions for randomly generated text to those of natural language corpora, for increasing orders of Markov Processes and HMMs with an increasing number of hidden states. Our analysis uses four real-world corpora: the Reuters corpus, Shakespeare’s Hamlet, Goethe’s Faust and source code of the Linux kernel.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Random texts exhibit Zipf's-law-like word frequency distribution

It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf's law observed in natural languages such as the English. The facts that the frequency of occurrence of a word is almost an inverse power law function of its rank and the exponent of this inverse power law is very close to 1 are largely due to the transformation from the word's length to it...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

The effect of 12 Hz Extremely Low-Frequency Electromagnetic Fields on Visual Memory of Male Macaque Monkeys

Introduction: Today, humans leave in a world surrounded by electromagnetic fields. Numerous studies have been carried out to discover the biological, physiological, and behavioral effects of electromagnetic fields on humans and animals. Given the biological similarities between monkeys and humans, the goal of the present research was to examine Visual Memory (VM), hormonal, genomic, and anatomi...

متن کامل

Why verb - initial languages are not so frequent

In our simulations with simple recurrent networks we demonstrate that small artificial languages are learnt differently depending on their basic word order. We show that verb-initial languages are difficult to learn, reflecting the lower frequency of verbinitial natural languages. We try to go beyond mere simulations proposing two objective mathematical measures to explain our results.

متن کامل

Detecting and Predicting Muscle Fatigue during Typing By SEMG Signal Processing and Artificial Neural Networks

Introduction: Repetitive strain injuries are one of the most prevalent problems in occupational diseases. Repetition, vibration and bad postures of the extremities are physical risk factors related to work that can cause chronic musculoskeletal disorders. Repetitive work on a computer with low level contraction requires the posture to be maintained for a long time, which can cause muscle fatigu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005